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Abstract 

Equilibrium statistical physics is applied to layered neural networks with differentiable 
activation functions. A first analysis of off-line learning in soft-committee machines with a 
finite number (K) of hidden units learning a perfectly matching rule is performed. Our results 
are exact in the limit of high training temperatures (/3 — * 0). For K = 2we find a second order 
phase transition from unspecialized to specialized student configurations at a critical size P 
of the training set, whereas for K > 3 the transition is first order. Monte Carlo simulations 
indicate that our results are also valid for moderately low temperatures qualitatively. The 
limit K — > oo can be performed analytically, the transition occurs after presenting on the order 
of NK//3 examples. However, an unspecialized metastable state persists up to P cx NK 2 / (i. 



Statistical physics provides tools for the investigation of learning processes in adaptive systems 
such as feedforward neural networks |l| . The by now standard analysis of off-line or batch learning 
from a fixed set of example data is based on the interpretation of training as a stochastic process 
which leads to a properly defined thermal equilibrium, ft has been applied with great success 
to simple networks like single layer preceptrons or specific multilayer architectures with binary 
threshold units. See e.g. [§, ||, f| for reviews and || for a discussion focussed on phase transitions 
in neural networks. 

In a somewhat different framework, the theory of on-line learning, it is assumed that a tem- 
poral sequence of independent examples is provided by the environment. For large networks, the 
inherently stochastic learning dynamics is described exactly by a set of deterministic differential 
equations, see e.g. Q for an up to date overview. This approach has made possible the recent 
progress with respect to layered networks with continuous node activations, see e.g. [6-10] and 
references therein. Such systems are relevant for applications as they can implement non-trivial 
regression schemes and practical training algorithms are available § @. 

In this Letter we present an analysis of off-line learning in two-layered architectures with 
differentiable transfer functions by means of equilibrium statistical physics. The considered model 
exhibits phase transitions in the learning process, i.e. a discontinuous dependence of the student 
performance on the number of examples. These transitions are the counterparts of quasi-stationary 
plateau states observed in on-line dynamics |^|, ^] and are due to the same inherent symmetries. 
The hidden unit specialization studied here is different from the sudden achievement of perfect 
generalization described in |i~3| ] for single, continuous nodes. 

Specifically, we will investigate in the following the learning of a rule in a fully connected 
two-layered neural network with total output 

= ^ Ef =1 9{xj) where Xj = ^ jO) . £ (1) 

upon presentation of an iV-dimensional input vector Given the hidden unit activation g(x), the 
adaptive weights £ 1R N define the input-output relation. The term soft-committee machine 
has been coined for this type of network |Q, ^J, as it can be interpreted as a continuous version 
of the thoroughly studied (hard) committee of binary hidden units (see [14-17] and references 
therein). Here, the weights of the linear hidden-to-output relation are fixed to the particular 
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value 1 / \[K which differs from the usual scaling considered in the analysis of on-line learning in 
soft-committees J|, ^] . 

Off-line learning in networks of type ([!]) has been studied before in the limit K — > oo by 
Kang et al. [|l4|. Here we will focus on finite K, but include a discussion of large machines for 
completeness. Note that the specific form of Eq. ([!]) yields outputs of finite magnitude in the 
limit K — > oo. Most frequently g(x) is taken to be a sigmoidal function of its argument, e.g. the 
hyperbolic tangent. We choose the similiar but more convenient function g(x) = erf(x/v / 2) which 
simplifies the mathematical treatment to a great extent yet should not alter the basic features of 
the model otherwise j?| ||. 

In this Letter, we restrict our analysis to scenarios in which the unknown rule r(£) can be 
parametrized through a teacher network of perfectly matching architecture and size [K) with 
weight vectors B' 3 '. Further, we assume that the are normalized vectors of length y/N with 
i.i.d. random components and accordingly impose a normalization (J( J )) 2 = N on the student 
vectors. Since the continuous student output, Eq. (Q), depends explicitely on the length of the 
weight vectors, this latter constraint corresponds to significant a priori knowledge of the rule's 
structure. We will discuss later how this restriction could be relaxed. 

The off-line training process is based on a fixed set of examples t(£ m )} (// = 1, 2, . . . , P) 
and is guided by the minimization of the cost function or training error 



(2) 



Thus, learning is formulated as an optimization problem. In networks with differentiable activation 
functions the (approximate) solution could be found by use of gradient descent or similiar methods. 
The prominent backpropagation of error, for example, is widely used in practice fj], Often, 
the actual global minimum of (||) is not identified by such algorithms, even if the rule is perfectly 
learnable. Practical learning prescriptions can be trapped in local minima of the training error or 
stop as soon as a satisfactory performance is achieved. 

After training the quadratic error measure can also be utilized to quantify the success of 
learning in terms of the so-called generalization error 
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Here (. . .) denotes an empirical average over a test set of (new) examples or over the distribution 
of random inputs, which is assumed to be known in the model. Throughout the following we take 
the components of all inputs £ to be i.i.d. random variables with zero mean and unit variance. In 
the thermodynamic limit N —> oo the quantities Xj — J 1 --?) -£/y/~N and yj — B^) ^/VN become 
zero mean correlated Gaussian variables by means of the central limit theorem. Their joint density 
is fully characterized by the covariances 
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where all diagonal Qu = 1. Hence, the average in Eq. (|^) reduces to a 2X-dimensional Gaussian 
integral which can be performed analytically for the specific activation function g(x) = erf (x/v2) ■ 
The result depends only on the order parameters {Rij, Qij}- 
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and was derived in []S[ for arbitrary network sizes and more general scenarios. 

Following the standard statistical physics approach to off-line learning we consider a Gibbs 
ensemble of networks which is characterized by the partition function 



Z = /d/x({jW}) exp[-/3Pe t ] 



(6) 



The extensive energy Pet is defined in Eq. (2) and the measure dfi limits the integration to the 
region in weight space where all (J^) 2 = N. The formal temperature 1/(3 controls the thermal 
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average of the energy in equilibrium. Equivalently, it fixes the amount of noise which is present 
in a corresponding stochastic training process. 

Typical properties of the equilibrium state can be calculated from the associated free energy 
on average over the quenched randomness contained in the training data ID = {£ M ,r(£ M )} (de- 
noted as (. . The evaluation of the quenched free energy — (ln.Z) D / (3 requires, in general, 
the application of the replica method. It becomes rather involved in models of the complexity 
considered here, see e.g. [15-17] for a treatment of networks with binary units. In order to obtain 
first results for the off-line training of soft-committee machines we resort to the simplifying limit 
of high temperatures. This strategy has proven useful for gaining first insights in a variety of 
models, see || ||, [l^] for details and example applications. 

In the limit (3 — > 0, we can replace (lnZ) K with ln^)^, i.e. the annealed approximation, 
which circumvents the replica formalism, becomes exact. Further, the average over the training 
data factorizes with respect to the example inputs and one obtains 



(\nZ) ]D /N =f3f({Q l3 ,R l3 }) = (J3P/N) e g ({Q l3 , R t] }) - s ({Q l3 , R t] }) 
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where the r.h.s. is to be minimized with respect to the order parameters. Non-trivial results 
can only be expected if the effective temperature ((3P/N) is of order 1, i.e. the high training 
temperature has to be compensated for by a large number of examples P oc N / (3. Consequently, 
training energy et and generalization error e g coincide in this limit. Eq. (0) contains the entropy 
term 
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J (j) Tlij (SQV-BW-NRiiMJV-jW-NQij)) (8) 



which can be evaluated by means of a saddle point integration after rewriting the ^-functions 
in their integral representation. One obtains s({Qij, Rij}) — ln[detC]/2 + const, where C is the 
(2K x 2if )-matrix of all cross- and self-overlaps of the vectors {J^^,B^'}. The constant term is 
independent of the order parameters and therefore irrelevant. 

In order to proceed with the analysis, we assume that the equilibrium student configuration is 
symmetric with respect to the hidden-units: 
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This assumption reflects the symmetry of the rule, yet allows for the specialization of student 
nodes: for R > S each of them has achieved a larger overlap with exactly one of the teacher 
vectors. In the limiting case R=1,S = C = the student is identical with the teacher and 
generalizes perfectly (e g = 0). Now entropy (apart from irrelevant constants) and generalization 
error read 
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Defining a — (3P/{NK), the rescaled number of examples per student weight, we have mini- 
mized / = a(Ke g ) — s with respect to the three order parameters R, C, S numerically for different 
network sizes. 

(2) 

For K — 2 we find that the equilibrium configuration is characterized by R = S for a < a c ~ 
23.7, whereas above this critical value the only solution obeys A = \R — S\ ^ 0. The system 
undergoes a second order phase transition and the specialization A increases close to the critical 
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) x where x = 0.5. Figure 1 shows that this spontaneous symmetry breaking 



translates into a kink in the learning curve e g {a). 

The picture is qualitatively different for all K > 3 where we observe a first order transition. 
Again, for small a the equilibrium solution is unspecialized (R = S). Then, for a > a locally 
stable solution with A > and significantly lower generalization error appears. It becomes the 
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Figure 1: a) Learning curves for small K. The system with K 

(2) 

order transtion at a c ~ 23.7 leading to the kink in e g (a) (marked by the circle). For K > 3 the 
transition is first order. Here we display the case K = 5, solid lines correspond to the globally 
stable solution, dashed lines denote local minima, (ai 5 ^ ~ 44.3, aa ~ 46.6 and a/p « 62.8). 
b) The learning curve in the limit K — * oo. The specialized state is a local minimum (dashed) for 
a > af 1 « 61.0 and becomes globally stable (solid) at a£° ~ 69.1. The unspecialized configuration 
remains locally stable up to — 4nK for large K. The inset displays the initial (unspecialized) 
decrease of e g with 0P/N. 



global minimum at a critical value a c where the free energies of the two solutions coincide. 
Finally, for a > a^^ , the local minimum with zero specialization disappears. Figure 1 (a) shows 
the learning curve for K — 5 as an example. A more detailed description of the dependence of 
order parameters on a will be given in a forthcoming publication. 

The difference in behavior between networks with K > 3 and K — 2 is due to the higher degree 
of symmetry in the latter case. The permutation symmetry of hidden units results in a free energy, 
Eqs. ( JTo| , [TT| ) , which is invariant under exchange of R and S only for K = 2. It is interesting to note 
that also the analysis of on-line learning in soft-committees has revealed qualitatively different 
features for K = 2 and K > 3, see the discussion of the fixed point structure in ||. 

Continuous Monte Carlo simulations of the learning process confirm our findings qualitatively 
and show that the basic features of the specialization process remain the same for relatively low 
temperatures. Figure 2 (a) displays the density of observed student teacher overlaps for K = 2 
close to equilibrium in the specialized and unspecialized phase (a = 15 and a — 40 respectively). 
In panel (b) the corresponding histograms are plotted for a network with K = 4 at a = 30 and 
a = 60. Note that the total weight of overlaps S should be a factor [K — 1) larger than the 
contribution of type R when hidden unit symmetry holds. This is confirmed very well and justifies 
the simplifying assumption ([)]). 

The behavior of very large networks in the limit K — > oo (but K « N) has been studied 
in |Q for the transfer function g(x) = tanh(x) within the annealed approximation. We repeat 
the discussion for (3 — > for completeness. Note that the analysis simplifies significantly due to 
the choice g(x) = erf(x/\/2)- Different regimes have to be distinguished, in analogy to previous 
studies of large multilayered networks [14-18]. First, we assume that (3P/N — aK — 0(1) and 
find that only an unspecialized solution exists in this initial phase of the learning process. The 
ansatz S = S/K and C = C/(K—1) yields a simplified free energy of the form / = (aK)(C/2 — 
S) - ln[l + 6- S 2 }/2 + C/2 + 0(1/ K 2 ) which is minimized for 

S = aK/(aK + n) and C = -iraK /(aK + tt) 2 . (12) 

The corresponding generalization error is plotted in the inset of Figure 1 (b) vs. aK and approaches 
the stationary non-zero value e g = 1/3 — 1/tt for large aK. 

Consequently, the unspecialized configuration is given by S = 1 and C = to first order in 
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Figure 2: Empirical densities of observed student teacher overlaps in Monte Carlo simulations of 
the learning process at inverse temperature f3 — 0.8. The system size is N = 50, after 50000 Monte 
Carlo steps allowed for equilibration, another 50000 were performed while sampling the densities. 
Panel a: K = 2, a = 15 (left) and a = 40 (right), panel b: K = 4, a = 30 (left) and a = 60 
(right). 



1/K when a = (3P/(NK) is of order one. However, specialization is possible in this regime and 
including a nonzero A = R — 5 = 0(1) in the above ansatz yields a solution with S — KS — 1 — A 
and C = KG = to first order. The corresponding specialization A is the largest positive solution 
of 

a = , 13 

(1 - A 2 )(2 — V4 — A 2 ) V ' 

which does only exist for a > s» 60.99. The critical value sa 69.09 is characterized by 
coinciding free energies of the specialized and unspecialized solution. 

In the specialized phase, the generalization error decreases asymptotically like e g — 2/ a for 
a — > oo which holds true also for general (small) K. The local minimum with A = remains 
locally stable even in the limit a — > oo. However, we can show that the metastable state disappears 
at ajj = AttK when K is large. The latter result is an extremely good approximation for K as 
small as 4 or 5 already. 

In summary, we have used equilibrium statistical physics to analyse off-line learning in two- 
layered soft-committee machines. First results for finite K are obtained in the high temperature 
limit which allows to calculate the quenched free energy analytically. Specifically, we have studied 
networks with K hidden units learning a perfectly matching rule. For K = 2 we find a second order 
phase transition from unspecialized to specialized student configuration at a critical number of 
examples. In secenarios with K > 3 the transition from poor to good generalization is first order. 
Monte Carlo simulations confirm the existence and nature of the transitions also for moderately 
low temperatures. The analysis of the limit K — > oo shows that a critical number P oc NK/ [3 of 
examples is needed for specialization, in agreement with the earlier findings of Kang et al. [fu] . 
As a novel result we observe that the metastable unspecialized state persists up to P = AnNK^fi 
for large K. 

Our results (for K > 3) seem to parallel to a large extent the findings of [14-16] for hard 
committee machines. We do not expect this correspondence to extend to very low temperatures, 
however. In the limit T — > 0, a transition to perfect generalization should occur after presenting 
on the order of O(NK) examples, which is analogous to the results of [|l3| for a single continuous 
node. This would not necessarily imply the existence of a corresponding practical algorithm. Note, 
however, that already the computationally cheap on-line gradient descent realizes an exponential 
decay of e g with a — P/{NK) as opposed to the much slower algebraic decay observed here. 

It will therefore be necessary to complete the picture by extending the analysis into the low tem- 
perature regime, e.g. within the annealed approximation. The application of the replica formalism 
should be possible for large networks in analogy to the work of [ [ToT , [l7| . Further investigations will 
address unlearnable rules as well as over-sophisticated students \\7[ |l§| . The introduction of a 
weight decay term to the training energy allows to relax the somewhat unnatural a priori normal- 
ization of student weights. First results concern a single unit and show a non-trivial dependence 
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of the performance on the weight decay parameter. 

We would like to thank W. Kinzel, G. Reents, and R. Urbanczik for stimulating discussions, 
and J. Hertz for drawing our attention to the analysis of large soft-committees in p4|. 
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